Dataset source : https://data.gov.sg/dataset/resident-population-by-ethnicity-gender-and-age-group Dataset contains data of the number of residents in Singapore from 1957-2018 split by age and ethnicity
import pandas as pd
df = pd.read_csv("sng.csv")
df.head()
The dataset contains number of residents split by ethnicity, gender and age. Age is in 5 year brackets.
df.dtypes
df["value"] = pd.to_numeric(df["value"],errors="coerce")
df.dtypes
df.isnull().sum().sum()
There are 2460 null values
df_nan = df[df.isna().any(axis=1)]
df_nan["level_2"].unique()
Since all the null values occur above the age of 70, it is safe to assume that null value means the value is 0. This is based on the assumption that if the null values occured due to errors in reporting or data entry they would be distributed randomly.
df = df.fillna(0)
df.iloc[344]
Replaced null values with 0.
Identifying the different ethnic groups:
df["level_1"].unique()
Isolating the entries of the year 2018(last year data was recorded)
df_2018 = df[df["year"]==2018]
df_2018.head()
Only considering total number of residents of each ethnic group
df_2018_total = df_2018[df_2018["level_1"].isin(["Total Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])].reset_index(drop=True)
df_2018_total_grp = df_2018_total.groupby(["year","level_1"],as_index=False)["value"].sum()
df_2018_sorted = df_2018_total_grp.sort_values("value")
df_2018_sorted.reset_index(drop=True, inplace=True)
df_2018_sorted
percentage_chinese = (df_2018_sorted.iloc[3,2]/df_2018_sorted.iloc[4,2])*100
percentage_chinese
import plotly.offline as pyo
pyo.init_notebook_mode()
import plotly.express as px
df_2018_sorted_mod = df_2018_sorted[df_2018_sorted.index != 4]
fig = px.pie(df_2018_sorted_mod, values='value', names='level_1', title='Population distribution of Singapore in 2018 by ethnicity')
fig.show()
Analyzing the growth of the chinese population in Singapore
df_chinese = df[df['level_1'].isin(["Total Chinese","Total Male Chinese", "Total Female Chinese"])].reset_index(drop=True)
df_chinese_grp = df_chinese.groupby(["year","level_1"],as_index=False)["value"].sum()
df_chinese_grp.head()
df_chinese = df_chinese_grp.pivot_table(values="value",index="year",columns="level_1")
df_chinese_mod = df_chinese.reset_index( drop=False, inplace=False )
df_chinese_mod.columns.names = [""]
df_chinese_mod.head()
population growth of a year = population in current year - population in previous year
growth_df = df_chinese.diff(axis=0)
growth_df.reset_index( drop=False, inplace=True )
growth_df.columns.names = [""]
growth_df = growth_df[growth_df.year != 1957]
growth_df.head()
This dataframe shows population growth each year
avg_m_chinese = growth_df['Total Male Chinese'].mean()
avg_f_chinese = growth_df['Total Female Chinese'].mean()
avg_chinese = growth_df['Total Chinese'].mean()
print(avg_m_chinese, avg_f_chinese, avg_chinese)
The average year-over-year population growth for chinese male residents was 21466 residents per year
The average year-over-year population growth for chinese female residents was 25061 residents per year
The average year-over-year population growth for total chinese residents was 46527 residents per year
import plotly.graph_objs as go
trace1 = go.Scatter(
x = df_chinese_mod.year,
y = df_chinese_mod['Total Chinese'],
mode = "lines",
name = "Chinese population",
)
trace2 = go.Scatter(
x = df_chinese_mod.year,
y = df_chinese_mod['Total Male Chinese'],
mode = "lines",
name = "Male Chinese population",
)
trace3 = go.Scatter(
x = df_chinese_mod.year,
y = df_chinese_mod['Total Female Chinese'],
mode = "lines",
name = "Female Chinese population",
)
data = [trace1,trace2,trace3]
layout = dict(title = 'Chinese population in Singapore',
xaxis= dict(title= 'Year'),
yaxis=dict(title= 'Number of Chinese residents')
)
fig = go.Figure(dict(data = data, layout = layout))
fig.show()
From this graph we can see that chinese population in Singapore has consistently increased since 1957, with the male and female populations exactly equal since 1980. Since then Chinese Female residents are higher in number.
import plotly.graph_objs as go
trace1 = go.Scatter(
x = growth_df.year,
y = growth_df['Total Chinese'],
mode = "lines",
name = "Chinese population growth",
)
trace2 = go.Scatter(
x = growth_df.year,
y = growth_df['Total Male Chinese'],
mode = "lines",
name = "Male Chinese population growth",
)
trace3 = go.Scatter(
x = growth_df.year,
y = growth_df['Total Female Chinese'],
mode = "lines",
name = "Female Chinese population growth",
)
data = [trace1,trace2,trace3]
layout = dict(title = 'Population growth per year',
xaxis= dict(title= 'Year'),
yaxis=dict(title= 'Population growth per year')
)
fig = go.Figure(dict(data = data, layout = layout))
fig.show()
Since the mid 1980s the female chinese population growth is slightly more than the male chinese population growth. There was a huge spike in chinese residents in 1990, probably due to an external policy change. The chinese population in Singapore has grown almost every year since 1958 and declined only in 2003.
df_2018_age_grp = df_2018.groupby(["year","level_2"],as_index=False)["value"].sum()
df_2018_age_grp.head()
df_2018_age_grp.rename({"level_2":"age_bracket"},axis=1, inplace=True)
df_2018_age_grp["age_bracket"].unique()
Removing data stored in the format "x Years & Over". This is because the same data is already stored in the corresponding age bracket, leading to doublecounting the data if not removed.
df_2018_age_grp = df_2018_age_grp[~df_2018_age_grp["age_bracket"].isin(["65 Years & Over","70 Years & Over","75 Years & Over", "80 Years & Over","85 Years & Over"])]
df_2018_age_sorted = df_2018_age_grp.sort_values("value",ascending=False)
df_2018_age_sorted.reset_index(drop=True, inplace=True)
df_2018_age_sorted.head()
fig = px.pie(df_2018_age_sorted, values='value', names='age_bracket', title='Population distribution of Singapore in 2018 (by age)')
fig.show()
Most residents in Singapore are middle aged, with the population split fairly evenly between the age of 30 and 60
df_age = df[df['level_2']=="50 - 54 Years"].reset_index(drop=True)
df_age = df_age.groupby(["year","level_2"],as_index=False)["value"].sum()
df_age.drop("level_2",axis=1,inplace=True)
df_age.head()
diff_age = df_age["value"].diff()
age_growth_df = pd.DataFrame()
age_growth_df["year"] = df_age["year"]
age_growth_df["growth"] = diff_age
age_growth_df = age_growth_df[age_growth_df.year != 1957]
age_growth_df.head()
avg_growth = age_growth_df["growth"].mean()
print(avg_growth)
import plotly.express as px
fig = px.line(age_growth_df, x='year', y='growth', title="Population growth vs Time in years of 50-54 year old Singaporean residents")
fig.show()
The population growth seems to be pretty erratic with several peaks and dips. There is a prominent spike in the year 2000. The population of 50-54 year old residents have been declining since 2016. This has happenned twice in the past - at 1995 and at 1970
import plotly.express as px
fig = px.line(df_age, x='year', y='value', title="Population vs Time in years of 50-54 year old Singaporean residents")
fig.show()
The general trend is an increase in the population of 50-54 year olds but the trend seems to have reversed in recent years.
In order to find group with highest growth rate we need to compare the data from 1958 and 2018
df_1958 = df[df["year"]==1958].reset_index(drop=True)
df_1958.head()
df_2018.reset_index(drop=True, inplace=True)
df_2018.head()
pop_change_df = pd.DataFrame()
pop_change_df[["type","age","1958"]] = df_1958[["level_1","level_2","value"]]
pop_change_df["2018"] = df_2018["value"]
pop_change_df["change"] = df_2018["value"]-df_1958["value"]
pop_change_df.head()
Average growth is (population change)/number of years,
pop_change_df["growth_rate"] = pop_change_df["change"]/60
pop_change_df.head()
print(pop_change_df["growth_rate"].max())
print(pop_change_df["growth_rate"].idxmax())
pop_change_df.iloc[17]
The 65 Years and Over sector has seen the highest growth rate from 1958 to 2018
Removing aggregate categories,
pop_change_df = pop_change_df[~pop_change_df["age"].isin(["65 Years & Over","70 Years & Over","75 Years & Over", "80 Years & Over","85 Years & Over"])]
pop_change_df = pop_change_df[~pop_change_df["type"].isin(["Total Residents","Total Male Residents","Total Female Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])]
pop_change_df.head()
pop_change_df.sort_values("growth_rate", ascending=False).head(3).reset_index(drop=True)
pop_change_df[(pop_change_df['growth_rate'] >= -0.5) & (pop_change_df['growth_rate'] <= 0.5)]
df_trends = df[df["level_1"].isin(["Total Residents","Total Male Residents","Total Female Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])]
df_trends = df_trends.rename(columns = {"value":"total_population"})
df_trends["total_population"].describe()
df_trends = df_trends.groupby(["year","level_1"],as_index=False)["total_population"].sum()
df_trends = df_trends.rename(columns={"level_1":"ethnicity"})
df_trends.head()
fig = px.scatter(df_trends, x="year", y="total_population", color="ethnicity", trendline="ols")
fig.show()
Removing aggregates
df_age = df[~df["level_1"].isin(["Total Residents","Total Male Residents","Total Female Residents","Total Malays","Total Chinese","Total Indians","Other Ethnic Groups (Total)"])]
df_age = df_age[~df_age["level_2"].isin(["65 Years & Over","70 Years & Over","75 Years & Over", "80 Years & Over","85 Years & Over"])]
df_age = df_age.rename(columns={"level_1":"ethnicity", "level_2":"age", "value":"population"})
fig = px.bar(df_age, x="age", y="population", color="ethnicity",
animation_frame="year", range_y=[0,1000000])
fig.show()
It can be observed how Singapore's population was initially skewed to younger ages. However, as time passed and Singapore went from a third-world country to a first-world country, the middle aged population exploded. This can be attributed to the large number of foreigners who migrated to Singapore in search of a better quality of life and financial opportunities.